Record: Late Soft-Round QAT + Score-First Backward-Looking TTT — val_bpb 1.1178#589
Closed
RoyiRa wants to merge 1 commit intoopenai:mainfrom
Closed
Record: Late Soft-Round QAT + Score-First Backward-Looking TTT — val_bpb 1.1178#589RoyiRa wants to merge 1 commit intoopenai:mainfrom
RoyiRa wants to merge 1 commit intoopenai:mainfrom
Conversation
…bpb 1.1178 3-seed mean: 1.1178 BPB (std 0.0005), ~15.75 MB artifact, 8×H100 SXM. Novel contribution: Late Soft-Round QAT — replaces STE identity surrogate with sigmoid soft-round in the backward pass during the final 2% of training, giving bin-aware gradients that settle weights onto int6 grid points. Built on PR openai#414 (base model), PR openai#461 (TTT recipe), PR openai#493 (LeakyReLU²).
Robby955
added a commit
to Robby955/parameter-golf
that referenced
this pull request
Mar 24, 2026
- Base: PR openai#589 architecture (11L GEPA, VE128, XSA, SWA, Late QAT) - New: Empirical Bayes Adaptive TTT (per-layer gradient SNR scaling) - New: Embedding freeze during TTT - Result: 1.1185 BPB on 8xH100 SXM (6909 steps, 15.81 MB artifact)
Contributor
|
This looks valid, but at the time of merging the chronological SOTA is PR #549, which this beats by less than the required 0.005 nats threshold. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Novel Contribution: Late Soft-Round QAT
Standard STE quantization-aware training uses hard rounding in the forward pass and an identity-style surrogate in the backward pass, which provides no bin-aware gradient signal near quantization boundaries. We replace that late in training with a temperature-controlled soft-round surrogate, giving the optimizer a non-zero, bin-aware gradient signal that encourages weights to settle onto nearby int6 grid points just before EMA/SWA finalization.
Score-First Backward-Looking TTT
Backward-looking adaptation following PR #461: validation tokens split into ~1,893 non-overlapping 32K-token chunks. Each chunk is scored under
torch.inference_mode(), then trained with SGD (cosine-decayed lr=0.002, momentum=0.9, 3 epochs). Chunk N is scored by a model adapted only on chunks 0..N-1.Results
Credits